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We propose a framework for the exact probabilistic analysis of window-based 
pattern matching algorithms, such as Boyer-Moore, Horspool, Backward 
DAWG Matching, Backward Oracle Matching, and more. In particular, we 
show how to efficiently obtain the distribution of such an algorithm's run- 
ning time cost for any given pattern in a random text model, which can be 
quite general, from simple uniform models to higher-order Markov models 
or hidden Markov models (HMMs). Furthermore, we provide a technique 
to compute the exact distribution of differences in running time cost of two 
algorithms. In contrast to previous work, our approach is neither limited to 
simple text models, nor to asymptotic statements, nor to moment computa- 
tions such as expectation and variance. Methodically, we use extensions of 
finite automata which we call deterministic arithmetic automata (DAAs) and 
probabilistic arithmetic automata (PAAs) [13J. To our knowledge, this is the 
first time that substring- or suffix-based pattern matching algorithms are an- 
alyzed exactly. Experimentally, we compare Horspool's algorithm, Backward 
DAWG Matching, and Backward Oracle Matching on prototypical patterns 
of short length and provide statistics on the size of minimal DAAs for these 
computations. 



1 Introduction 

The basic pattern matching problem is to find all occurrences of a pattern string in 
a (long) text string, with few character accesses. Let I be the text length and m be 
the pattern length. The well-known Knuth-Morris-Pratt algorithm [9] reads each text 



1 



character exactly once from left to right and hence needs exactly I character accesses 
for any text of length £, after preprocessing the pattern in 0(m) time. In contrast, 
the Boyer-Moore g], Horspool [8J, Sunday [32], Backward DAWG Matching (BDM, 0) 
and Backward Oracle Matching (BOM, [Tj) algorithms move a length-m search window 
across the text and first compare its last character to the last character of the pattern. 
This often allows to move the search window by more than one position (at best, by m 
positions if the last window character does not occur in the pattern at all), for a best 
case of n/m, but a worst case of ran character accesses. The worst case can often be 
improved to 0(m + n), but this makes the code more complicated and seldom provides 
a speed-up in practice. For practical pattern matching applications, the most impor- 
tant algorithms are Horspool, BDM (often implemented as Backward Nondeterministic 
DAWG Matching, BNDM, via a non-deterministic automaton that is simulated in a 
bit-parallel fashion), and BOM, depending on alphabet size, text length and pattern 
length; see [TBI Section 2.5] for an experimental map. 

A question that has apparently so far not been investigated is about the exact prob- 
ability distribution of the number of required character accesses X v t when matching a 
given pattern p against a random text of finite length I (non-asymptotic case), even 
though related questions have been answered in the literature. For example, [21 [3J ana- 
lyze the expected value of Xf for the Horspool algorithm. In [12] it is further shown that 
for the Horspool algorithm, X\ is asymptotically normally distributed for i.i.d. texts, 
and [18J extends this result to Markovian text models. In [20J, a method to compute 
mean and variance of these distributions is given. 

In contrast to these results that are special to the Horspool algorithm, we use a general 
framework called probabilistic arithmetic automata (PAAs), introduced at CPM'08 [13] , 
to compute the exact distribution of Xf for any window-based pattern matching algo- 
rithm. In [13] , PAAs where introduced in order to compute the distribution of occurrence 
counts of patterns; the fact that they can also be used to analyze pattern matching algo- 
rithms further highlights their utility. In our framework, the random text model can be 
quite general, from simple i.i.d. uniform models to high-order Markov models or HMMs. 
The approach is applied exemplarily to the following pattern matching algorithms in 
the non-asymptotic regime (short patterns, medium-length texts): Horspool, B(N)DM, 
BOM. We do not treat BDM and BNDM separately as, in terms of text character ac- 
cesses, they are indistinguishable (see Section I2T21 . 

This paper is organized as follows. In the next section, we introduce notation and give 
a brief review of the Horspool, B(N)DM and BOM algorithms. In Section [3J, we define 
deterministic arithmetic automata (DAAs). In Section HI we present a simple general 
DAA construction for the analysis of window-based pattern matching algorithms. We 
also show that the state space of the DAA can be considerably reduced by adapting DFA 
minimization to DAAs. In Sectional we summarize the PAA framework with its generic 
algorithms, define finite-memory text models and connect DAAs to PAAs. This yields, 
for each given pattern p, algorithm, and random text model, a PAA that computes the 
distribution of X\ for any finite text length I. Section [6] introduces difference DAAs 
by a product construction that allows to compare two algorithms on a given pattern. 
Exemplary results on the comparison of several algorithms can be found in Section [71 
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There, we also provide statistics on automata sizes for different algorithms and pattern 
lengths. Section contains a concluding discussion. 

An extended abstract of this work has been presented at LATA' 10 [TS] with two 
alternative DAA constructions. In contrast to that version, the DAA construction in 
the present paper can be seen as a combination of both of those, and is much simpler. 
Additionally, the DAA minimization introduced in the present paper allows the analysis 
of much longer patterns in practice. While [15] was focused on Horspool's and Sunday's 
algorithms, here, we give a general construction scheme applicable to any window-based 
pattern matching algorithm and discuss the most relevant algorithms, namely Horspool, 
BOM, and B(N)DM, as examples. 

2 Algorithms 

Both pattern and text are over a finite alphabet S. Indexing generally starts at zero. 
The pattern p = p[0] . . .p[m — 1] is of length m; a (concrete) text s is of length i. By P, 
we denote the reverse pattern p[m — 1] . . .p[0]. 

In the following, we summarize the Horspool, B(N)DM and BOM algorithms; algo- 
rithmic details can be found in [TBI Chapter 2]. 

We do not discuss the Knuth-Morris-Pratt algorithm because its number of text char- 
acter accesses is constant: Each character of the text is looked at exactly once. Therefore, 
is the Dirac distribution on £, i.e., P(X P — £) — 1. 

We also do not discuss the Boyer-Moore algorithm, since it is never the best one in 
practice because of its complicated code to achieve optimal asymptotic running time. 
In contrast to our earlier paper [15], we also skip the Sunday algorithm, as it is almost 
always inferior to Horspool's. Instead, we focus on those algorithms that are fastest in 
practice according to [TBJ Fig. 2.22]. 

The Horspool, B(N)DM and BOM algorithms have the following properties in com- 
mon: They maintain a search window w of length m = \p\ that initially starts at 
position in the text s, such that its rightmost character is at position t = m — 1. 
The right window position t grows in the course of the algorithm; we always have w = 
s[(t — m + 1) . . . t]. The algorithms look at the characters in each window from right to 

left, and thus compare the reversed window with the reversed pattern p. (For Horspool, 
variants with different comparison orders are possible, but the rightmost character is 
always compared first.) 

The two properties of an algorithm that influence our analysis are the following: For 
a pattern p G S m , each window w G S m determines 

1. its cost C p (w), e.g., the number of text character accesses required to analyze this 
window, 

2. its shift shift p (w), which is the number of characters the window is advanced after 
it has been examined. 
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2.1 Horspool 



First, the rightmost characters of window and pattern are compared; that means, a := 
w[m — 1] = s[t] is compared with p[m — 1] . If they match, the remaining m — 1 characters 
are compared until either the first mismatch is found or an entire match has been verified. 
This comparison can happen right-to-left, left-to-right, or in an arbitrary order that may 
depend on p. In our analysis, we focus on the right-to-left case for concreteness, but the 
modifications for the other cases are straightforward. Therefore, the cost of window w 
is 



In any case, the rightmost window character a is used to determine how far the window 
can be shifted for the next iteration. The shift-function ensures that no match can be 
missed by moving the window such that a becomes aligned to the rightmost a in p (not 
considering the last position). If a does not occur in p (or only at the last position), it 
is safe to shift by m positions. Formally, we define 



For concreteness, we state Horspool's algorithm and how we count text character 
accesses as pseudocode in Algorithm [TJ Note that after a shift, even when we know 
that a now matches its corresponding pattern character, the corresponding position is 
compared again and counts as a text access. Otherwise the additional bookkeeping 
would make the algorithm more complicated; this is not worth the effort in practice. 
The lookup in the shift-table does not count as an additional access, since we can 
remember shift [a] as soon as the last window character has been read. 

The main advantage of the Horspool algorithm is its simplicity. Especially, a window's 
shift value depends only on its last character, and its cost is easily computed from the 
number of consecutive matching characters at its right end. The Horspool algorithm 
does not require any advanced data structure and can be implemented in a few lines of 
code. 



2.2 Backward (Nondeterministic) DAWG Matching, B(N)DM 



The main idea of the BDM algorithm is to build a deterministic finite automaton (in 
this suffix automaton, which is a directed acyclic word graph or DAWG) that 

recognizes all substrings of the reversed pattern, accepts all suffixes of the reversed 
pattern (including the empty suffix), and enters a FAIL state if a string has been read 
that is not a substring of the reversed pattern. 

The suffix automaton processes the window right-to-left. As long as the FAIL state 
has not been reached, we have read a substring of the reversed pattern. If we are in an 
accepting state, we have even found a suffix of the reversed pattern (i.e., a prefix of p). 




m if p = w, 

min{i : 1 < % < m, p[m — i] ^ w[m — i]} otherwise. 



right p (a) 
shift [a] 

shift p (w) 



max [{i 6 {0, . . . , m - 2} : p[i] = a} U {-1}] , 
(m — 1) — right p (a) , assuming p fixed, 
shif t[w[m — 1]] . 



4 



Algorithm 1 HORSPOOL-WITH-COST 
Input: text s G £*, pattern p G S m 

Output: pair (number occ of occurrences of p in s, number cost of accesses to s) 
1: pre-compute table shift [a] for all a G S 
2: (occ, cost) <- (0, 0) 
3: t <— m — 1 
4: while t < \s\ do 
5: i <- 

6: while i<mdo 

7: cost cost + 1 

8: if s[t — i\ 7^ p[(m — 1) — i] then break 

9: i i + 1 

10: if % = to then occ i— occ + 1 
11: t <- t + shift[s[t]] 
12: return (occ, cost) 



Whenever this happens before we have read m characters, the last such event marks the 
next potential window start that may contain a match with p, and hence determines the 
shift. When we are in an accepting state after reading to characters, we have found a 
match, but this does not influence the shift. 

So, £ p (u>) is the number of characters read when entering FAIL, or to if p = w. Let 
I p (w) C {0, . . . , to — 1} be the set defined by % G I p (w) if and only if the suffix automaton 

•s— 

of P is in an accepting state after reading i characters of w. Then 

shift p (w) = min {to — % | % G I p (w)}. 

Note that I p {w) is never empty as the suffix automaton accepts the empty string and, 
thus, G l p (w) for all windows w. 

The advantage of BDM are long shifts, but its main disadvantage is the necessary 
construction of the suffix automaton, which is possible in 0(m) time via the suffix tree 
of the reversed pattern, but too expensive in practice to compete with other algorithms 
unless the search text is extremely long. 

Constructing a nondeterministic finite automaton (NFA) instead of the deterministic 
suffix automaton is much simpler. However, processing a text character then does not 
take constant, but 0(m) time. However, the NFA can be efficiently simulated with bit- 
parallel operations such that processing a text character takes 0(m/W) time, where W 
is the machine word size. For many patterns in practice, this is as good as 0(1). The 
resulting algorithm is then called BNDM. 

From the "text character accesses" analysis point of view, BDM and BNDM are 
equivalent, as they have the same shift and cost functions. 
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2.3 Backward Oracle Matching, BOM 

BOM is similar to BDM, but the suffix automaton of the reversed pattern is replaced by 
a simpler deterministic automaton, the factor oracle (rarely also called suffix oracle). It 
may recognize (accept) more strings than substrings (suffixes) of the reversed pattern, 
but is much easier to construct. It still guarantees that, once the FAIL state is reached, 
the sequence of read characters is not a substring of the reversed pattern. 

The cost and shift functions are defined as for BDM, but based on the oracle. We refer 
to [12] for the construction details and further properties of the oracle. By construction, 
BOM never gives longer shifts than B(N)DM. The main advantage of BOM over BDM 
is reduced space usage and preprocessing time; the factor oracle only has m + 1 states 
and can be constructed faster than a suffix automaton. 

3 Deterministic Arithmetic Automata 

In this section, we introduce deterministic arithmetic automata (DAAs). They extend 
ordinary deterministic finite automata (DFAs) by performing a computation while one 
moves from state to state. Even though DAAs can be shown to be formally equivalent 
to families of DFAs on an appropriately defined larger state space, they are a useful 
concept before introducing probabilistic arithmetic automata (PAAs) and allow us to 
construct PAAs for the analysis of pattern matching algorithms in a simpler way. 

Definition 1 (Deterministic Arithmetic Automaton, DAA). A deterministic arithmetic 
automaton is a tuple 

where Q is a finite set of states, go G Q is the start state, E is a finite alphabet, 
5:QxE->Qisa transition function, V is a finite or countable set of values, v o G V is 
called the start value, £ is a finite set of emissions, r] q G £ is the emission associated to 
state q, and 9 q : V x £ — > V is a binary operation associated to state q. 

Informally, a DAA starts with the state-value pair (qo,Vo) and reads a sequence of 
symbols from E. Being in state q with value v, upon reading a G E, the DAA performs 
a state transition to q' := 5(q, a) and updates the value to v' := O q r(v,r] q >) using the 
operation and emission of the new state q' . 

Further, we define the associated joint transition function 

6 : (Q x V) x E -> (Q x V), S((q,v),a) := {d(q,a) , e SM (v,r} SM )). 

As usual, we extend the definition of 5 inductively from E to E* in its second argument 
by v )i £ ) := [Qi v ) f° r the empty string e and S((q,v),xa) := S[S((q, v), x), a) for 
all x G E* and a G E. When S[(q , v ), s) = (q,v) for some q G Q and s G E*, we say 
that V computes value v for input s and define valuers) := v. 
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For each state g, the emission r] q is fixed and could be dropped from the definition of 
DAAs. In fact, one could also dispense with values and operations entirely and define a 
DFA over state space Q x V, performing the same operations as a DAA. However, we 
intentionally include values, operations, and emissions to emphasize the connection to 
PAAs (which are defined in Section |5]). 

As a simple example for a DAA, take a standard DFA (Q, go, S, 5, F) with F C Q 
being a set of final (or accepting) states. To obtain a DAA that counts how many times 
the DFA visits an accepting state when reading x G £*, let £ := {0, 1} and define r\ q := 1 
if q G F, and rj q := otherwise. Further define V = IN with vq := 0, and let the operation 
in each state be the usual addition: 9 q (v,e) := v + e for all q. Then value?>(x) is the 
desired count. 



4 Constructing DAAs for Pattern Matching Analysis 

For a given algorithm and pattern p G S m with known shift and cost functions, shift p : 
S m -> {1, . . . , m}, w i — y shift p (w) and £ p : E w -)■ IN, w >->• £ p (w), we construct a DAA 
that upon reading a text s G S* computes the total cost, defined as the sum of costs 
over all examined windows. (Which windows are examined depends of course on the 
shift values of previously examined windows.) Slightly abusing notation, we write £ p ( s ) 
for the total cost incurred on s. 

While different constructions are possible (see also [IS]), the construction presented 
here has the advantage that it is simple to describe and implement and processes only 
one text character at a time. This property allows the construction of a product DAA 
that directly compares two algorithms as detailed in Section 

We define a DAA by 

• Q := S m x {0,...,m}, 

• g := (P,m), 

• V := IN, 

• v := 0, 

• 8 :={!,.. . ,m}, 



V[iv,x) 




if x > 0, 
if x = 0, 



9 q : (v, e) i->- v + e for all g G Q (addition), 

, \ \ \{w'o, x - 1) if x > 0, 

o : (w, x), a ) i-> < 

V ; \(w'a, shift p (w) - 1) if x = 0, 

where w' is the length- (m — 1) suffix of w, i.e., w' := w[l] . . . w[m — 1]. 
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Informally, the state q = (w, x) means that the last m read characters spell w and 
that x more characters need to be read to get to the end of the current window. For 
the start state (p,m), the component p is arbitrary, as we need to read m characters to 
reach the end of the first window. The value accumulates the cost of examined windows. 
Therefore, the operation is a simple addition in each state, and the emission of state 
(w, x) specifies the cost to add. Consequently, the emission is zero if the state does not 
correspond to an examined window (x > 0), and the emission equals the window cost 
C p {w) ii x = 0. The transition function 8 specifies how to move from one state to the next 
when reading the next text character u6S: In any case, the window content is updated 
by forgetting the first character and appending the read a. If the end of the current 
window has not been reached (x > 0), the counter x is decremented. Otherwise, the 
window's shift value is used to compute the number of characters till the next window 
aligns. 

Theorem 1. With the DAA T> constructed as above, valuers) = £ p (s) for all s G E*. 

Proof. The total cost £ p (s) can be written as the sum of costs of all processed windows: 
= Sieis £, p (s[i — m + 1 . . .i]), where I s is the set of indices giving the processed 
windows, i.e. I s C {m — 1, . . . , \s\ — 1} such that 

i G I s :<^=>- i = in — 1 or 3j G I s : i = j '• + shift p (s[j — m+ 1 . . . j]). 

We have to prove that the DAA computes this value for s G £*. 

Let (wi,Xi) be the DAA state active after reading s [..«]. Observe that the transition 
function 5 ensures that the w ^-component of (wi,Xi) reflects the rightmost length-m 
window of s[..i], which can immediately be verified inductively. Thus, the emission on 
reading the last character s[i] of s[..i] with i > m — 1 is, by definition of r]( WijX .), either 
C p {s[i — m + 1 . . . i\) or zero, depending on the second component of (wi, Xi). As the 
operation is an addition for all states, valuex>(s) = Y^ier — m + 1 . . .i]) for 

I' s ■= [i e {0,...,\s\ - 1} : Xi = 0}. 

It remains to show that I s — I' s . To this end, note that by 5, we have Xi + \ — X{ — 1 if 
x i+ i > and Xi + \ = shift p (wi) — 1 if Xi + \ = 0. As go = (p, m ), it follows that m— 1 G I' s . 
Using Wi — s[i — m + 1 . . . i] for i > m — 1, we conclude that whenever Xi — 0, it follows 
that Xj = for j = i + shift p (s[i —m + l...i]) and that xy > for i < j' < j. Hence 
we obtain that % G I' s implies that i + shift p (s[i — m + 1 . . . i]) G I' s and i + k ^ I' s for 
< k < shift p (s[i — m + l...i\), which completes the proof. □ 

DAA Minimization The size of the constructed DAA's state space depends expo- 
nentially on the pattern length, making the application for long patterns infeasible in 
practice. However, depending on the particular circumstances (i.e., algorithm and pat- 
tern analyzed), the constructed DAA can often be substantially reduced by applying a 
modified version of Hopcroft's algorithm for DFA minimization 0. 

Hopcroft's algorithm minimizes a DFA in 0(|Q| log \ Q\) time by iteratively refining a 
partition of the state set. In the beginning, all states are partitioned into two distinct 
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sets: one containing the accepting stats, and the other containing the non-accepting 
states. This partition is iteratively refined whenever a reason for non-equivalence of two 
states in the same set is found. Upon termination, the states are partitioned into sets 
of equivalent states. Refer to [TU] for an in-depth explanation of Hopcroft's algorithm. 

The algorithm can straightforwardly be adapted to minimize DAAs by choosing the 
initial state set partition appropriately. In our case, each DAA state is associated with 
the same operation. The only differences in state's behavior thus stem from different 
emissions. Therefore, Hopcroft's algorithm can be initialized by the partition induced 
by the emissions and then continued as usual. 

As we exemplify in Section [7J this leads to a considerable reduction of the number of 
states. 

5 Probabilistic Arithmetic Automata 

This section introduces finite-memory random text models and explains how to construct 
a probabilistic arithmetic automaton (PAA) from a (minimized) DAA and a random text 
model. PAAs were introduced in [13], where they are used to compute pattern occurrence 
count distributions. Other applications in biological sequence analysis include the exact 
computation of p- values of sequence motifs [2] , and the determination of seed sensitivity 
for pairwise sequence alignment algorithms based on filtering [6]. 

5.1 Random Text Models 

Given an alphabet E, a random text is a stochastic process {S t )teN y where each S t takes 
values in S. A text model P is a probability measure assigning probabilities to (sets of) 
strings. It is given by (consistently) specifying the probabilities P(5*o . . . «S'| s |_i = s) for 
all seE*. We only consider finite-memory models in this article which are formalized 
in the following definition. 

Definition 2 (Finite-memory text model). A finite- memory text model is a tuple 
(C, c , X, <f), where C is a finite state space (called context space), c G C a start context, £ 
an alphabet, and <p : C x £ x C — > [0, 1] a transition function with ^ CTgS c / gC (f(c, cr, d) = 1 
for all c G C. The random variable giving the text model state after t steps is denoted C t 
with Co := Co. A probability measure is now induced by stipulating 

n-l 

P(S . . . 5 n _i = s,d = ex, . . . ,C n = c n ) := Y\ <f(ci, s[i], c i+ i) 

i=0 

for all neN , s e E n , and (ci, . . . , c n ) G C n . 

The idea is that the model given by (C, cq, S, tp) generates a random text by moving 
from context to context and emitting a character at each transition, where ip(c, a, d) 
is the probability of moving from context c to context d and thereby generating the 
letter a. 
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Note that the probability P(So • • • ^H-i = s ) * s obtained by marginalization over all 
context sequences that generate s. This can be efficiently done, using the decomposition 
of the following lemma. 

Lemma 1. Let (C, Cq, S, (p) be a finite-memory text model. Then, 

P(S ...S n = sa, C n+ i = c) = } j P(S ■ ■ ■ S n -i = s,C n = d) • ip(d, a, c) 

c'ec 

for all n G M , s G S n , cr G £ and c G C. 
Proof. We have 

P(5'o . . . 5 n = sa,C n+1 = c) 
= ^ P(S ■ ■ ■ S n = scr, Ci = ci, . . . , C n = c n , C n+1 = c) 

c 1 ,...,c„ 

n-l 

a,...,cn i=0 

=xm n v&i^ci+i) ) • <£>(c„,0-,c) 

c„GC \ci,...,c n _ 1 i=0 / 

= P(S'o . • • S n _i = S, C n = Cn) ■ ip(c n , a, c) . 
c n ec 

Renaming c n to d yields the claimed result. □ 

Similar text models are used in [II] . where they a called probability transducers. In 
the following, we refer to a finite-memory text model (C, cq, S, if) simply as text model, 
as all text models considered in this article are special cases of Definition [2] 

For an i.i.d. model, we set C = {e} and ip(e, a, e) = p a for each o G E, where p a is 
the occurrence probability of letter a (and e may be interpreted as an empty context). 
For a Markovian text model of order r, the distribution of the next character depends 
on the r preceding characters (fewer at the beginning); thus C := |J[ =0 £\ This notion 
of text models also covers variable order Markov chains as introduced in [IT], which 
can be converted into equivalent models of fixed order. Text models as defined above 
have the same expressive power as character-emitting HMMs, that means, they allow to 
construct the same probability distributions. 

5.2 Basic PAA Concepts 

Probabilistic arithmetic automata (PAAs), as introducted in [T3], are a generic con- 
cept useful to model probabilistic chains of operations. In this section, we sum up the 
definition and basic recurrences needed in this article. 
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Definition 3 (Probabilistic Arithmetic Automaton, PAA). A probabilistic arithmetic 
automaton is a tuple V = (Q,q ,T,V,v ,£,/i = (n q ) q eQ,9 = (9 q )g eQ ), where Q, q , V, 
Vq, £ and 9 have the same meaning as for a DAA, each \x q is a state-specific probabil- 
ity distribution on the emissions £, and T : Q x Q — > [0, 1] is a transition function, 
such that T(g, q') gives the probability of a transition from state g to state q', i.e. 
q')) , g g is a stochastic matrix. 
A PAA induces three stochastic processes: (1) the state process (Qt)tm with values 
in Q, (2) the emission process (E t )teN with values in £, and (3) the value process (Vt)teti 
with values in V such that Vo := t>o and Vt ■= 9Q t {V t -i, E t ) . 

We now restate the PAA recurrences from [JS] to compute the state- value distribution 
after t steps. For the sake of a shorter notation, we define ft(q,v) := P(Qt = q, Vt = v). 
Since we are generally only interested in the value distribution, note that it can be 
obtained by marginalization: P(V t = v ) = Yliq^Q ft(qi v )- 

Lemma 2 (State- value recurrence, [13]). 

The state-value distribution is given by fo(q, v) — 1 if q = qo and v = vq, and fo(q, v ) = 
otherwise. For t > 0, 

ft+i(q, v) = Yl M> v "> ' T ( q '> q "> ' ^( e )> ^ 

9'GQ (v> ,e)G6» q " 1 (v) 

where 9~ l (v) denotes the inverse image set of v under 9 q . 

The recurrence in Lemma [2] resembles the Forward recurrences known from HMMs. 

Note that the range of V t is finite for each t, even when V is infinite, as V t is a function 
of the states and emissions up to time t, and state set Q and emission set £ are finite. 
We define V t '■= range V t an d $ n '■= niax <t< n \Vt\- Clearly < (|Q| • \£\) n . Therefore 
all actual computations are on finite sets. When analyzing the number of character 
accesses of a pattern matching algorithm, we have Vt C {0, . . . ,m(n — m + 1)}, as at 
most in — m + 1) search windows are processed, each causing at most m character 
accesses. Thus, i? n G 0(n • m). 

5.3 Constructing a PAA from a DAA and a Text Model 

We now formally state how to combine a DAA and a text model into a PAA that allows 
us to compute the distribution of values produced by the DAA when processing a random 
text. 

Lemma 3 (DAA + Text model — > PAA). Let (C, Cq,E, (p) be a text model and T> = 
(Q v , qf, S, 5, V, v , £, (Vq) q eQV, (9f) qeQ v) be a DAA. Then, define 

• a state space Q := Q v x C, 

• a start state go : = (<?o\ c o); 
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• transition probabilities 

T((g,c),(g',c')) := £ <p(c,<r,c'), (2) 

<reS: 5(q,a)=q' 

• (deterministic) emission probability vectors 

( e ) .= J 1 ife = r ?<?' 
1 otherwise , 

for all (g, c) G Q. 

• operations 0( q , c )(v,e) := 9^(v,e) for all (g, c) G Q. 

Then, P = (Q, q , T, V, v , £, // = (/i g ) ge c> 61 = (0<?)</ee) is a PAA with 

£(Vt) = £(value v (S . . . S^-i)) , 

for all £ G [No, where is a random text according to the text model (C, c , E, tp). States 
having zero probability of being reached from go m ay be omitted from Q and T. For 
such a PAA, the value distribution C(V n ) can be computed with 0(n- \ Q V \ ■ \C\ 2 ■ |E| 
operations using OdQ^I • |C| ■ $„) space. If for all c G C and cr G S, there exists at most 
one c'gC such that <£>(c, a, c') > 0, then the runtime is bounded by 0(n- \ Q v \ -\C\ -|E| -$ n ). 

Proof. V is a PAA by Definition [3j As in Section [5.21 we define ft{q,v) := P(Q* = 
g, 14 — u)- Iverson brackets are written [■], i.e. [A] = 1 if the statement A is true and 
[A] = otherwise. 
To prove C(V t ) = C(valueT>(So . . . S t -i)), we show that 

ft {(q v , c),v) = J2 IH(& w o), s) = (q v , v)j ■ P(S . . . = s,C t = c) (3) 

for all q v G Q v , c G C, v G V, and £ G DM . For £ = 0, Equation © is correct 
by definitions of PAAs, DAAs and text models. For it > we prove it inductively. 
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Assume fl3]) to be correct for all t' with < t' < t. Then 

v) (4) 

=:q 

= E E f*-i(M-Ttf,Q)-N(e) 

= E E K»( v '> e ) = «1 • z*-!^. u ') • W. • h<r = 4 

q'eQ (v',e)eVx£ 

= E E E K° v> e ) = «1 ■ [ v = e l • y ') 

^,a) = gT^(c',a,c) 

o-GS 

= E E E E E [«?>(«'.«)=«]•[*>=«] 

•p(? rt, ^) = g D l-P((9?^).*) = (^« / )] 

•P(So...^_ 2 = S ,C t - 1 = c / )-^(c , ,a,c) 

= E E E KK e )^]-[v^]-PW.4 S ) 

so-eE' q m eQ T) (V,e)eVx£ 

■ W v , a) = q v ] ■ P(S . . . St-i = sa, C t = c) 

In the above derivation, step (|4j) — ?-(j5]) follows from flTJ. Step © — follows from the 
definitions of 9 and Step ©— ^flZJ) uses the definitions of T and Q in Lemma El Step 
©— >•© uses the induction assumption. Step (jHJ)— )•© uses Lemma [TJ The final step 
fl9|) — ?• ffTUl) follows by combining the four Iverson brackets summed over q' v and (v ', e) 
into a single Iverson bracket. 

To compute the table f n containing f n (q, v) for all q G Q and v G V, we start with / 
and perform n update steps. The given runtime bounds can be verified by considering a 
"push" algorithm: When computing ft+i, we initialize the table with zeros and iterate 
over all q G Q, v G V and q' G {q" G Q : T(q,q") > 0}; for each combination of q, v, 
and q', we add /t(g,v) -T(q,q') to ft+i(q',0 g/ (v,r] q/ )). □ 

As a direct consequence of the above lemma and of the DAA construction from Sec- 
tion m we arrive at our main theorem. 

Theorem 2. Given a finite-memory text model (C, cq, E, a window-based pattern 
matching algorithm A, a pattern p with |p| = m, and the functions shift A,p and £, A ' P , 
the cost distribution £(X^ ,P ) can be computed using 0{n 2 -m- \Q D \ ■ \C\ 2 ■ |E|) time and 
0(| Q v \ ■ \C\ ■ n ■ m) space. Since \Q V \ is bounded by 0{ra ■ E m ), the computation uses 
0{n 2 ■ m 2 ■ S m+1 • \C\ 2 ) time and 0(m 2 ■ S m • \C\ ■ n) space. If for all c G C and a G E, 



(5) 
(6) 

(7) 
(8) 

= (?'V)] 

(9) 
(10) 
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there exists at most one d G C such that (p(c, a, d) > 0, a factor of \C\ can be dropped 
from the runtime bounds. 

Applying DAA minimization before the PAA construction results in considerable 
speed-ups in practice. Alternatively, algorithm dependent construction schemes may 
be used to construct small automata. Tsai [20], for instance, gives algorithms to com- 
pute the asymptotic mean and variance of the number of comparisons used by Horspool's 
algorithm; for that, he constructs a Markov chain with 0(m 2 ) states. His construction 
can immediately be adapted to construct a DAA with 0{m 2 ) states. 

6 Comparing Algorithms with Difference DAAs 

Computing the cost distribution for two algorithms allows us to compare their perfor- 
mance characteristics. One natural question, however, cannot be answered by compar- 
ing these two (one-dimensional) distributions: What is the probability that algorithm A 
needs more text accesses than algorithm B to scan the same random text? The answer 
will depend on the correlation of algorithm performances: Do the same instances lead 
to long runtimes for both algorithms or are there instances that are easy for one algo- 
rithm but difficult for the other? This section answers these questions by constructing a 
PAA to compute the distribution of cost differences of two algorithms. That means, we 
calculate the probability that algorithm A needs v text accesses more than algorithm B 
for all !)6l 

We start by giving a general construction of a DAA that computes the difference of 
the sum of emission of two given DAAs. 

Definition 4 (Difference DAA). Let V 1 = (Q\q 1 ,J:,5\V\v 1 ,S\(r] 1 q ) q€Q i,(e 1 q ) qeQ i) 
and V 2 = (Q 2 , q 2 ,E, 5 2 ,V 2 , v 2 , £ 2 , {i] 2 ) qG Q2, {0 2 ) q <z Q 2) be DAAs given over the same al- 
phabet £ with V 1 = V 2 = DM, f o = f 2 = 0, £ l ,£ 2 C DM, and all operations are additions 
of previous value and current emission. The difference DAA is defined as 



DiffDAA{V\V 2 ) := (Q, q , E, 5, V, v , £, (r) g ) qeQ , (0 q ) qeQ ) 



where 



• Q 



Q 1 x Q 2 and q := (g£,gg), 



• V 



Z and Vq := 
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Lemma 4. Let T> 1 and T> 2 be DAAs meeting the criteria given in Definition H] and 
V := DiffDAA{V l ,V 2 ). Then, 

valuers) = valuex>i(s) — valueT>z{s) 

for all s G £*. 

Proof. Follows directly from Definition HI □ 

Lemma H] can now be applied to the DAAs constructed for the analysis of two algo- 
rithms as described in Section HI Since the above construction builds the product of 
both state spaces, it is advisable to minimize both DAAs before generating the product. 
Furthermore, in an implementation, only reachable states of the product automaton 
need to be constructed. Before being used to build a PAA (by applying Lemma [3]), the 
product DAA should again be minimized. 

As discussed in Section I5T2} at most m(n — m + 1) character accesses can result from 
scanning a text of length n for a pattern of length m. Thus, the difference of costs 
for two algorithms lies in the range {— m(n — m + 1), . . . , m(n — m + 1)} and, hence, 
•d n G 0{n ■ m). 

7 Case Studies 

In Section [2j we considered three practically relevant algorithms, namely Horspool's 
algorithm, backward oracle matching (BOM), and backward (non)-deterministic DAWG 
matching (B(N)DM). Now, we compare the distributions of running time costs of these 
algorithms for several patterns. Figure [1] shows these distributions for the patterns 
ATATAT and ACGTAC for text lengths 100 and 500 under a uniform i.i.d. model on the 
DNA alphabet {A,C,G,T}. For text length 500, the distributions for Horspool and 
B(N)DM resemble the shape of normal distributions. In fact, for Horspool's algorithm 
it has been proven that the distribution is asymptotically normal [TS]. For smaller text 
lengths (e.g. 100, as shown in left column of Figured]), the distributions are less smooth 
than for longer texts. It is remarkable that for BOM we find zero probabilities with a 
fixed period. In all examples shown Figure [1] this period equals 7. 

The probability that one pattern matching algorithm is faster than another depends on 
the pattern. Using the technique introduced in Section El we can quantify the strength 
of this effect. Figure [2] shows distributions of cost differences for different patterns and 
algorithms. That means, the probability that the first algorithm is faster is represented 
by the area under the curve left of zero. For the pattern CGAAAA, for example, there 
is a 55.6% probability that Horspool's algorithm needs fewer character accesses than 
B(N)DM in uniform i.i.d. texts of length 100, while for ACGTAC, the probability is only 
0.18%. 

Worth noting and perhaps surprising is the fact that there is a non-zero probability 
of BOM being faster than B(N)DM altough, shift B ^ DM ' p (w) > shift BOM >P(w) for all 
window contents w. The explanation, of course, is that a shorter (say, first) shift for 
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Figure 1: Exact distributions of character access counts for patterns ATATAT (top) and 
ACGTAC (bottom) for text length 100 (left) and text length 500 (right). An 
i.i.d. text model with uniform character distribution is used. 



BOM leads to a different window content than for B(N)DM for the second window, 
which may have a larger shift value. This effect depends on the pattern: For the pattern 
CAAAAA, there is a 48.2% probability that BOM performs better than B(N)DM, while it 
is 6.2% for ACGTAC, again on texts of length 100. 

To assess the effect of DAA minimization before constructing PAAs, we constructed 
minimized DAAs for all 21840 patterns of lengths 2 to 7 over E = {A, C, G, T}. The 
minimum, average, and maximum state counts are shown in Table [TJ For length 6, 
Figure [3] contains a detailed histogram. These statistics show that construction and 
minimization as given in this article lead to smaller automata (and thus better runtimes) 
than the constructions given in the conference version of this article |15j . It may be 
conjectured that that minimal state spaces grows only polynomial with m for all of 
these algorithms, as has been previously proven for the Horspool algorithm [20] . 

A JAVA implementation is available at |http : //www . rahmannlab . de/ software . All 
algorithms were run on an Intel Core 2 Quad CPU at 2.66GHz. Computing the distri- 
butions shown in Figure [1] took 0.3 to 0.6 seconds for each distribution. Distributions of 
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Figure 2: Exact distributions of differences in character access counts for different pat- 
terns using a uniform character distribution as text model and random texts 
of lengths 100. 



differences as in Figure |5] were computed in 14 to 36 seconds. 



8 Discussion 

Using PAAs, we have shown how the exact distribution of the number of character ac- 
cesses for window-based pattern matching algorithms can be computed. The framework 
is general enough to admit i.i.d. text models, Markovian text models of arbitrary order, 
and character-emitting hidden Markov models. The given construction results in an 
asymptotic runtime of 0(n 2 ■ m ■ \Q T> \ ■ \C\ 2 ■ |E|). The number of DAA states Q v is 
0(m ■ S m ), but it can be considerably reduced by DAA minimization. The resulting 
PAA is smaller and therefore computing the cost distribution is much faster. If the pat- 
tern length m is large, however, construction and minimization of the DAA itself pose a 
significant burden. It remains open if there exists an algorithm to construct the minimal 
automaton directly in general, i.e. using only 0(| <2^ lhl |) time. For Horspool's algorithm, 
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Table 1: Comparison of DAA sizes for all patterns of length m over S = {A, C, G, T}. 



m States unminimized States minimized (min./avg. /max.) 

|S| m -(m + l) Horspool BOM B(N)DM 



2 48 4 / 4.8 / 5 4 / 4.0 / 4 4 / 4.8 /5 

3 256 7 / 8.3 / 9 7 / 8.3 / 9 7 / 9.6 / 10 

4 1280 11 / 14.3 / 15 11 / 15.6 / 18 11 / 17.0 / 19 

5 6144 16 / 23.6 / 25 16 / 26.5 / 30 16 / 27.9 / 31 

6 28672 22 / 37.0 / 39 22 / 41.8 / 47 22 / 42.8 / 48 
7 131072 29 / 55.2 / 58 29 / 62.4 / 70 29 / 62.6 / 70 
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Figure 3: Histogram on number of states of minimal DAAs over all patterns of length 6 
over E = {A, C, G, T}. 

an automaton with 0(m 2 ) states can be constructed by using the ideas from [20]. The 
construction, however, cannot be easily transferred to other algorithms and we are not 
aware of any similar results for BOM or B(N)DM. 

In this work, we considered the most practically relevant algorithms. Exemplarily 
studying the cost distribution for some patterns showed that the algorithms perfor- 
mance is indeed non-negligibly influenced by the choice of the pattern. Especially the 
behavior of BOM deserves further attention: Its distribution of text character accesses 
features periodic zero probabilities, and unexpectedly, it may need fewer text accesses 
than B(N)DM on some patterns, although BOM's shift values are never better than 
B(N)DM's. 

We focused on algorithms for single patterns, but the presented techniques also apply 
to algorithms to search for multiple patterns like the Wu-Manber algorithm [21] or set 
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backward oracle matching and multiple BNDM as given in [16J. A comparison of the 
resulting distributions could yield new insights into these algorithms as well. 

Other metrics than text character accesses might be of interest and could be easily 
substituted; for example, just counting the number of windows by defining £ p (w) = 1 
for all w6E m 

The given constructions allow us to analyze an algorithm's performance for each pat- 
tern individually. While this is desirable for detailed analysis, the cost distribution 
resulting from randomly choosing text and pattern would also be of interest. 
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